Why PCA Looks Triangular (often in computational biology)

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

Abstract

In this notebook, we will discuss some peculiarities of data that we often work with in computational biology that often results in triangular looking PCA structures, however this does not always persist in tSNE / UMAP dimension reduction.

Table of Contents:

Triangular PCA plots everywhere in Life Sciences

Visualization of biological high-dimensional data is often performed with linear dimensionality reduction techniques such as PCA / MDS as well as non-linear tSNE and UMAP. Remarkably, across many data types in Life Sciences, such as microbial ecology, cell biology, medical and evolutionary genetics, one can observe PCA taking a particular triangular shape as in the figures below.

In the figure above, genetic variation differences across world's populations from the 1000 Genomes Project (1000G) are presented at the left, and microbial abundance differences across different human tissues from the Human Microbiome Project (HMP) is shown in the right. Also in single cell gene expression analysis the trianglar shapes of PCA plots are quite common as it can be see from the PCA on gene expression from Cancer Associated Fibroblasts (CAFs) below on the left, and Innate lymphoid cells (ILCs) on the right.

PCA on data from different probability distributions

It turns out that some data types in Life Sciences can produce triangle / wedge - shaped PCA plots because of the non-gaussian nature of the data, i.e. if the data was not properly transformed prior to computing PCA. To demonstrate this, let us generate random data matrices without any correlation using different probability distributions such as Poisson, Binomial etc. that are typical in Life Sciences.

Here, one can observe that purely normally distributed data without any correlation structure should give nearly spherical spread of the data points around the origin in a PCA plot. Data follwing uniform distribution produce a funny square-looking PCA, as if one pulls the gaussian circle (on the previous plot) in two orthogonal directions. In contrast, Poisson, Log-Normal and Negative Bionomial distributed data tend to form triangular or wedge-looking structures on a PCA plot. Even, Binomially distributed data can often give a PCA that looks like a wedge or triangle. The thing is that the simulated samples are supposed to be independent, i.e. not correlation was put inside the data matrices, and the samples are not supposed to form clusters. However, with some imagination, one can overinterpet the points in the corners of the trianguler / wedge PCA shapes as belonging to different clusters / populations.

The triangular / wedge - looking structures on random data can partly be explained by non-gaussianity of many types of data in Life Sciences that seems to violate a very fundamantal assumption of PCA. Namely, PCA can be viewed as approximating a data set with a product of two matrices (loadings and scores) with the following implicit minimization of the squared error between the data and the approximation. Minimization of the squared error implicitly means that one assumes the normal / gaussian distribution of the approximation error or residuals, which might be problematic to satisfy if the data matrix X does not follow gaussian distribution. One can ask whether it is ok at all to use PCA on non-normally distributed data? In practice, PCA is often run on non-normally distributed data providing some data transform has been applied prior to computing PCA.

It turns out that standardization and log-transform is typically applied to many Life Science data, that masks the non-gaussianity of the data and fulfills the normality assumption of the PCA. In practice, it results in a less pronounced tringular shape of the PCA plots. Let us apply log-transoform to the previously simulated data matrices and observe the difference in PCA plots.

We can observe that the spread of the data points in the PCA plots becomes more and more spherical once the log-transform and standardization has been applied. This intuitivaly means that there is no direction of preferred orientation of the data in the high-dimensional space, i.e. all directions are equally informative / important.

The above example dealt with purely random data without any correlation between samples. Often, there is a lot of correlation present in gene expression or microbial abundance data, i.e. samples or cells are not independent and form clusters. In this case, one can obtain meaningful triangular PCA plot where the corners of the triangle do mean different sample populations. One extreme case is explained below and is called the arch (or similar horseshoue) effect.

Data matrix with block structure and arch effect

The triangular shape of the PCA plots seems to have something to do with so-called arch or horseshoue effect, which is not particularly well studied in genetics or single cell analysis, to my experience, but is pretty well known in microbial ecology. One needs to check some old discussions to get more information about the two effects. There is also an excellent paper explaining in depth the origin of the horshoue effect that I highly recommend to read.

It turns out that one can get a triangular looking PCA when there is a strong correlation between statistical observations, i.e. when sampes / cells belong to distinct clusters / populations. In this case, the data matrix has a peculiar form: one can visually observe blocks of elements with large values for some samples and some variables while low values (zeros) are everywhere else in the matrix. Let us simulate a data matrix with two such blocks and visualize the elements of the matrix as a heatmap..

The data matrix simulated above contains 200 samples (S0,..., S199) and 200 variables (V0,...,V199). The darker area in the heatmap corresponds to zero elements of the matrix while the brighter colors correspond to larger values of the elements of the matrix. Therefore, the data matrix consists of two clear blocks / clusters / populations, where samples S0,...,S99 seem to be completely dicoupled from the samples S100,...,S199 seems they do not have common variables, i.e. variables V0,...,V99 have non-zero values only for the samples S0,...,S99, while variables V100,...,V199 have non-zero values only for the samples S100,...,S199. This is an extreme decoupling, for real world projects this is rarely the case, but this simulation will help us to build our intuition about what causes trianguler structures in PCA plots. In microbial ecology, people usually talk about environmentsl gradients that cause such block-looking data (pH, temperature etc.). By averaging elements across the variables in both blocks, one can indeed view this data matrix as if two non-interacting species are present in the data that do not share environment, i.e. one can't find a sample where both species are present.

Above, one can see that if samples happen to be ordered by some strong environmental gradient, this might result in a situation when certain species are abundant in a group of samples but completely absent in other groups. For example, soil microbiome might be very differnt from marine microbiome. In this case, it is generally become more and more problemetic to compute correct similarities between the samples from extremely different environments. However, here, for building our intuition, let us now compute a PCA on the block data matrix described above and visualize the variance explained by each pricipal component. For comparison, we will also provide a tSNE plot.

Here we can see that almost all variance in the data is captured by the first principal component. This is not surprising because we know that there is a very strong dominating groupping effect of two clusters among the samples. Therefore, one can think about the number k of leading PCs (explaining most of the variation in the data) as a quick way to estimate the number of clusters in the data as k+1 clusters.

Next, we can see that already with two-block matrix the famous triangular shape of PCA becomes quite visible (with some imagination though). We can clearly see two clusters in both PCA and tSNE plots. However, what happens if we increase the number of blocks in the data matrix? Let us simulate a data matrix with three blocks and visualize the variance explained by each PC as well as PCA and tSNE plots.

As one might expect, we see that two PCs explain most of the variance in the data, and the three clusters of samples are clearly visible in both PCA and tSNE plots. The situation, however, changes dramatically when the number of clusters increases, when we have more than 2-3 clusters. Let us now simulate a block matrix with 10 distinct populations of samples.

A remarkable thing is that we do not seem to be able to distinguish 10 clusters in the PCA plot, in fact a few populations seem to clump together so that no more than 4 clear clusters can be detected from the visual inspection of the PCA plot. This would mean that any clustering algorithm would have big troubles groupping samples to their populations based on just two principal components even if we provide the exact number of populations in the data. Therefore one will need to consider more then two principal components to get a correct clustering of samples, this however will be increasingly more and more difficult to visualize, i.e. will be more and more difficult to reach an agreement between dimensionality reduction and clustering. Strikingly, tSNE seems to easiy resolve all the 10 clusters with only two latent tSNE variables. Even if this is not a good idea, but in priciple, clustering on the 2D tSNE picture would be successful (i.e. one would easily group all samples to their 10 populations) for this particular data.

Data matrix with fuzzy block structure and horseshoue effect

The extreme block structure descibed in the previous section is rarely encountered in real world data. Usually, the samples clusters are less transparent, one might expect data matrices with more fuzzy block structure to be quite typical for Life Sciences projects. One way to add some "fuzziness" to the previously describes block data matrix would be a band data matrix, where only elements around diagonal are non-zero. Here, we still suspect some sample clusters but they are not as transparent as for the block data matrix. Let us simulate and visualize a typical band data matrix with two "fuzzy" blocks.

This kind of band matrix might appear in computation biology projects when there is a strong spatial or temporal ordering of the samples, usually they say that there is a gradient that affects the abundances in the samples. Examples of such gradient can be pH, temperature, spatial coordinate, time period of sampling, there can be also technical gradients such as batch (lab, technology, country). Using the previous intuition about species present in a particular environment but absent in others, we can again average matrix elements across groups of variables (that correspond to two "species") and observe a more gradual, i.e. much less abrupt, change of species 1 and 2 abundances when changing from one environment to the other. Below we can see that e.g. species 1 can also be present (althoug in lower abundance) in the environment that was previusly occupied solely by species 2.

What about a PCA for that band data matrix? Below we will compute the variance explained by principal components as well as visualize PCA and tSNE for the simulated band data matrix with presumably two fuzzy sample clusters.

Here we observe the famous arch or horseshoe in the PCA (and also tSNE) plot. The shape of the data points spread is still triangular, and the two "wings" of the arch / horseshoe correspond to two different clusters. Also, please note the gradual change in the variance explained by principal components plot. With the clean block structure matrix there were a few equally important leading PCs, while now there is one leading PC that typically correspond to a strong gradient (e.g. temporal) that dominates the values of the band data matrix. Typically, they say that samples in such PCA plots are ordered along the horseshour from early to late in time (temporal ordering) or from geographic west to geographic east (spatial ordering). In single cell analysis such ordered "threads" of cells are typical when monitoring stem cell differentiation, i.e. when cells are ordered along so-called cell fate trajectory. Typically, diffusion maps (another type of dimension reduction technique) are great at resolving cell trajectories and ar widely used instead of PCA when working with cell development in single cell biology.

The formation of the peculiar arch or horseshoe looking shapes in PCA plots has something to do with the Lissajous curves known in mathemtics. Those curves can be seen if we plot each principal component as a function of sample number when samples has been ordered along an environmental gradient e.g. time or spatial position. Let us check that those curves are visible:

Indeed, very interesting peculiar shapes! What will happen if we increase the number of "fuzzy" sample populations?

We can see that the ends of the horseshoue start bending towards the center of mass. Another thing that we can observe is that the darkgreen and magenta sample populations seem to be very distant from each other while having the red sample population as a common neighbor. This fact can be intuitively understood if we look at band data matrix and realize that the first 100 (darkgreen) and last 100 (magenta) samples hardly share any variables, while the red samples (100 samples in the middle of the band matrix) share plenty of V-variables with both magenta and darkgreen sample populations. Now, let us increase the number of "fuzzy" blocks even more and see what happens to the PCA and tSNE plots.

One obvios thing here is that the ends of the horseshoe in the PCA plot are now bent quite a lot toward the center of mass of the horseshoue. Therefore, here we have a counter-intuitive effect: the gold and darkgreen sample populations at the ends of the horseshoue seem to be much closer to each other than to e.g. cyan and blue sample populations in the middle of the horseshoue (and they will be approaching each other even more if we increase the number of "fuzzy" blocks in the data matrix). Now, imagine that there was a temporal gradient acting along the horseshoue, i.e. the samples were odered from early in time to later in time with darkgreen samples being youngest and gold samples being oldest while the cyan and blue samples are from some intermediate times. What this mutual approaching of the ends of the horseshoue means us that the youngest and the oldest samples become very similar to each other compared to the samples from the intermediate time, which is strange. One would expect that the youngest and the oldest samples are more dissimilar to each other than they both are with respect to the intermediate time samples.

The interpretation of the PCA horseshoue becomes even trickier when comparing genetic variation across world populations. If there is a spatial gradient (geographic separation) of samples where genetic variation was measured as the V-variables in the band data matrix, it might seem from the PCA plot that populations from far east are more genetically similar to the populations from far west than with respect to their common neighbours.